Method of programming linear graphs for streaming vector computation

ABSTRACT

A method for producing a formatted description of a computation representable by a data-flow graph and computer for performing a computation so described. A source instruction is generated for each input ( 502, 522 ) of the data-flow graph, a computational instruction is generated for each node ( 506, 510, 514  etc) of the data-flow graph, and a sink instruction is generated for each output ( 520, 540 ) of the data-flow graph. The computation instruction for a node includes a descriptor of the operation performed at the node and a descriptor of each instruction that produces an input to the node. The formatted description is a sequential instruction list (A, B, C, . . . , J, K, L, FIG.  2 ) comprising source instructions, computational instructions and sink instructions. Each instruction has an instruction identifier and the descriptor of each instruction that produces an input to the node is the instruction identifier. The computer includes a program memory ( 618 ) and a number of computational units ( 602 ) interconnected by an interconnection unit ( 604 ). The computer is directed by a program of instructions stored in the program memory ( 618 ) to implement a computation representable by a data-flow graph (e.g.  100 ). The program of instructions is generated from a sequential instruction list.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application is related to co-pending patent applicationstitled “INTERCONNECTION DEVICE WITH INTEGRATED STORAGE” and identifiedby Attorney Docket No. CML00101D, “MEMORY INTERFACE WITH FRACTIONALADDRESSING” and identified by Attorney Docket No. CML00102D,“RE-CONFIGURABLE STREAMING VECTOR PROCESSOR” and identified by AttorneyDocket No. CML00107D, “SCHEDULER FOR STREAMING VECTOR PROCESSOR” andidentified by Attorney Docket No. CML00108D, which are filed on even dayherewith and are hereby incorporated herein by reference.

FIELD OF THE INVENTION

[0002] This invention relates generally to the field of vectorprocessing. More particularly, this invention relates to a method ofprogramming linear data-flow graphs for streaming vector computing andto a computer for implementing the resulting program of instructions.

BACKGROUND OF THE INVENTION

[0003] Many new applications being planned for mobile devices(multimedia, graphics, image compression/decompression, etc.) involve ahigh percentage of streaming vector computations. In vector processing,it is common for a set of operations to be repeated for each element ofa vector or other data structure. This set of operations is oftendescribed by a data-flow graph. For example, a data-flow graph may beused to describe all of the operations to be performed on elements ofthe data structure for a single iteration of a program loop. It may benecessary to execute these operations number of times during theprocessing of an entire stream of data (as in audio or video processingfor example). Computing machines that do this processing would bebenefit from a representation of the data-flow graph that can beexecuted directly.

[0004] It would also be beneficial if the representation were expressiveenough for execution on a range of computing machines with differentparallel processing capabilities. Consequently, the representation mustbe both a series of computations for linear execution on a sequentialcomputing machine and also a list of operational dependencies within andbetween iterations for concurrent execution on a parallel computingmachine.

[0005] In a conventional (Von Neumann) computer, a program counter (PC)is used to sequence the instructions in a program. Program flow isexplicitly controlled by the programmer. Data objects (variables) may bealtered by any number of instructions, so the order of the instructionscannot be altered without the risk of invalidating the computation.

[0006] In a data-flow description, data objects are described as theresults of operations, so an operation cannot be performed until thedata is ready. Apart from this requirement, the order in which theoperations are carried out is not specified.

[0007] It is possible to represent the operations of a data-flow graphas a series of operations from a known computer instruction set, suchthe instruction sets for the Intel x86 or Motorola M68K processors.However, the resulting programs are difficult to execute in a parallelmanner because unnecessary dependencies often force serialization of theoperations. These unnecessary dependencies arise because all results ofoperations must be stored in a small set of named registers before beingused in subsequent operations. This creates resource contention andresults in serialization, even for computing machines that haveadditional registers. The use of named registers to pass results alsoobscures the differences between data dependencies within an iterationand data dependencies between iterations. If it is known that there areno dependencies between iterations, then all iterations of a loop can beimplemented simultaneously: The parallelism is limited only by theamount of resources on the computing machine.

[0008] Consequently, there is an unmet need for a method for describinga data-flow graph that represents both operational dependencies and datadependencies whilst avoiding the use of named registers.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The novel features believed characteristic of the invention areset forth in the appended claims. The invention itself, however, as wellas the preferred mode of use, and further objects and advantagesthereof, will best be understood by reference to the following detaileddescription of an illustrative embodiment when read in conjunction withthe accompanying drawing(s), wherein:

[0010]FIG. 1 is an exemplary data-flow graph.

[0011]FIG. 2 is a table of a sequential instruction list in accordancewith the present invention.

[0012]FIG. 3 is a table of a sequential instruction list in accordancewith the present invention showing data dependencies.

[0013]FIG. 4 is a table of a sequential instruction list in accordancewith the present invention showing order dependencies.

[0014]FIG. 5 is a diagrammatic representation of simultaneous executionof multiple iterations of a computation using tunnels in accordance withthe present invention.

[0015]FIG. 6 is a diagrammatic representation of a computer forperforming a calculation described by a sequential instruction list inaccordance with the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0016] While this invention is susceptible of embodiment in manydifferent forms, there is shown in the drawings and will herein bedescribed in detail one or more specific embodiments, with theunderstanding that the present disclosure is to be considered asexemplary of the principles of the invention and not intended to limitthe invention to the specific embodiments shown and described. In thedescription below, like reference numerals are used to describe thesame, similar or corresponding parts in the several Views of thedrawings.

[0017] The present invention relates to a computer program executionformat for a general-purpose machine for accelerating iterativecomputations on streaming vectors of data. The invention also relates acomputer for executing a program in the specified format. The format isthe instruction set of a sequential data-flow processor with all of thedependencies explicitly stated to facilitate parallel execution.

[0018] Computations are conveniently represented as data-flow graphs. Anexemplary data-flow graph is shown in FIG. 1. Referring to FIG. 1, thedata-flow graph 100 consists of a number of external interaction blocks,A, B, C, D, K and L, and a number of computational nodes E, F, G, H, Iand J. The computational nodes are also referred to a processing nodesor functional nodes. In the data-flow graph representation, the programflow is determined by the interconnections between the computationalnodes and the external interaction blocks. The order in which parallelpaths in the graph are executed is not specified. In FIG. 1, a firstvalue from a data input stream is loaded at external interaction block Aand a second value from the same stream is loaded at block B. The orderof these two operations is important, so an order dependency isspecified in the graph, as indicated by the broken arrow 102. Similarly,consecutive data values from a second input stream are loaded atexternal interaction blocks C and D, the order being indicated by brokenarrow 104. At computational node E, the values loaded at A and B aremultiplied (indicated by the mnemonic ‘vmul’). The values input asoperands to the multiplication are signed, 16-bit values, as indicatedby ‘s16’ on the inputs to node E. The output from node E, is similarlyindicated as being a signed, 32-bit value (‘s32’). At computational nodeF, the values loaded at B and D are multiplied (indicated by themnemonic ‘vmul’). The values input as operands to the multiplication aresigned, 16-bit values, as indicated by ‘s16’ on the arcs connectingblocks A and B to node F. The output from node F is similarly indicatedas being a signed, 32-bit value (‘s32’). Computational nodes G, F, I andJ operate similarly, with the mnemonics ‘vsub’ and ‘vadd’ indicatingsubtraction and addition respectively. At external interaction block K,the result of the subtraction performed at node I is accumulated into anamed accumulator a0. At external interaction block L, the result of thesubtraction performed at node J is accumulated into the namedaccumulator a0.

[0019] If the first input stream is the interleaved real and imaginaryparts of a complex vector x, and the second input stream is theinterleaved real and imaginary parts of a complex vector y, then theaccumulator contains the sum of the real and imaginary parts of thevector dot product x.y,

[0020]FIG. 2 shows a linear graph representation of the data-flow graphshown in FIG. 1. Each instruction is identified by an instructiondescriptor. In this example, the corresponding node identifiers fromFIG. 1 are used, however, this is not a requirement. The instructions A,B, C and D indicate the loading of vector elements. The linear order ofthe instructions denotes order dependencies in the data-flow graphrepresentation. Multiplication instruction E includes the mnemonic‘vmul’, indicating that the operation is a multiplication, and theoperands A and C. This indicates that the operands for themultiplication operation are the results of the operations A and C (thevector load operations). Note that because order dependency is indicatedby the linear order of the instructions, the result of vector load A isthe first data value in the input vector and the result of vector load Bis the second data value in the input vector. At the next iteration ofthe data-flow graph, these will be the third and fourth valuesrespectively.

[0021] In one embodiment of the present invention, the computationalinstruction E is written as

[0022] E: vmul A, C

[0023] This instruction includes the identifier of the instruction(‘E’), a descriptor of the operation to be performed (‘vmul’) and thedescriptors of the instructions that produce the operands for thecomputation (‘A’ and ‘C’).

[0024] In a further embodiment of the present invention, thecomputational instruction E is written as

[0025] E: vmul.s32 A, C

[0026] This instruction include the appended descriptor ‘.s32’,indicating that the result of the operation is a signed, 32-bit value.Other descriptors include ‘s8’, ‘s16’, ‘s24’, ‘u8’ and ‘u16’, forexample.

[0027] The format of the present invention uses references to previousinstructions, rather then named registers, to indicate the passing ofoperation results (data dependencies) within an iteration. The type andsize of the result and whether the results is signed or unsigned (thesignedness of the result) are indicated by the producing instruction.Results that are passed between iterations are explicitly indicated byinstructions that manipulate a set of named registers, calledaccumulators, and by instructions that manipulate a set of unnamed FIFO(First-In, First-Out) registers called tunnels.

[0028] Referring to FIG.2, instruction K accumulates the result ofinstruction I into an accumulator named ‘a0’. This named accumulator isused in each iteration and at the start of the iteration it will holdthe value from the previous iteration. Accumulator a0 is used again ininstruction L. The linear order of instructions K and L indicates thatthe result from instruction I is accumulated before the result fromoperation J.

[0029] Thus, in the program format of the present invention, eachexternal interaction node and each computational node is represented byan instruction. The instruction comprises an instruction identifier, ainstruction mnemonic, and one or more operands. For computationalinstructions, the operands are the identifiers of the instructions thatgenerate the inputs the computation, for external interactions theoperands are the destination for input data and the source instructionand destination of output data.

[0030] Data dependencies are explicit, since the operands reference theinstructions that generate the data rather than a named storagelocation. This is illustrated in FIG. 3. Referring to FIG. 3, the datadependencies of the linear graph are shown. The arrows point from aninstruction to the prior instructions that produce the inputs for thatinstruction. For example, instruction H depends upon data produced byinstructions B and C. Thus data dependencies are represented in theformat. Operands are indicated as references to an instruction'sresults, thereby eliminating unnecessary contention for named registers.

[0031] Dependencies due to the execution order of instructions thatcause changes in state, called order dependencies, are indicated by theserial order of these non-independent instructions in the instructionlist. FIG. 4 shows the order dependencies of the computation. The brokenarrows point from the first instruction to be executed to a subsequentinstruction. Order dependencies are specified independently of the datadependencies, thereby supporting simultaneous execution of multipleiterations as long as the order of state changes is maintained.

[0032] The computation is thus represented as a sequential instructionlist, including a source instruction for each input of the data-flowgraph, a computational instruction for each node of the data-flow graphand a sink instruction for each output of the data-flow graph. Eachinstruction includes an instruction identifier, and the computationinstruction for a node includes a descriptor of the operation performedat the node and the identifier of each instruction that produces aninput to the node. The computational instructions include arithmetic,multiplication and logic instructions. The source instructions includeinstructions to load data from an input data stream, load a scalar valuefrom a store, load a value from an accumulator and retrieve a value froma tunnel. The sink instructions include instructions to add, subtract orstore to an accumulator, output to an output data stream or pass to atunnel.

[0033] In one embodiment of the present invention, tunnels are used tosave a result from an operation in the current iteration while producingthe result saved from a previous iteration. Tunnels indicate data flowsbetween consecutive iterations in a graph, where the source and sink ofthe flow are the same point in the graph. This allows multipleiterations to be executed simultaneously, since data from one iterationcan be concurrently passed to the next iteration. Accumulators,described above, cannot do this since their source and sinks are atdifferent points in the data-flow graph.

[0034] An exemplary use of tunnels is shown in FIG. 5. In this example,two consecutive iterations of a computation are performed in parallel,with data passed from one iteration to the next via two tunnels.Referring to FIG. 5, the first iteration begins with the data elementbeing loaded into vector v1 at external interaction block 502. The dataelement is passed to a first tunnel 504. The data value is stored in thetunnel and the previously stored value is produced. The previouslystored value is added to the loaded data element at node 506 and alsopassed to a second tunnel 508. The previously stored value from tunnel508 is added at node 510 to the result of the addition 506. At node 514the constant value from block 512 is multiplied by the result fromaddition 510. At node 518 the result from multiplication 510 isright-shifted by the constant (16) stored in block 516. The result fromthe right-shift operation is stored to output vector v0 at externalinteraction block 520.

[0035] The next iteration begins with the next data element being loadedinto vector v1 at external interaction block 522. The data element ispassed to the first tunnel 524. The data value is stored in the tunneland the previously stored value is produced. The previously stored valueis the value stored in the tunnel by the previous iteration. In thisway, data is passed between iterations. The previously stored value isadded to the loaded data element at node 526 and also passed to thesecond tunnel 528. The previously stored value is the value stored inthe tunnel by the previous iteration. The previously stored value fromtunnel 528 is added at node 530 to the result of the addition 526. Atnode 534 the constant value from block 532 is multiplied by the resultfrom addition 530. At node 538 the result from multiplication 534 isright-shifted by the constant (16) stored in block 536. The result fromthe right-shift is stored to output vector v0 at external interactionblock 540.

[0036] If only two iterations are carried out in parallel, the thirditeration begins at block 502, and the values retrieved from the tunnels504 and 508 are the values stored in the second iteration. The use oftunnels therefore also allows data to be passed between iterationsperformed in parallel.

[0037] The data-flow graph in FIG. 5 performs a three-pointmoving-average of a vector of data values.

[0038] A program of computer instructions may be generated from thesequential instruction list. The generation may include scheduling ofthe instructions to make efficient use of the hardware resources of thecomputer. The format of the present invention allows a computation to bescheduled efficiently for linear execution on a sequential computingmachine or for concurrent execution on a parallel computing machine. Oneembodiment of a computer that is directed by a program of instructionsgenerated from a sequential instruction list is shown in FIG. 6.Referring to FIG. 6, the computer 600 includes a number of computationalunits 602 and an interconnection unit 604. In the figure, thecomputational units are a multiplier 606, an adder (arithmetic unit)608, a logic unit 610 and a shifter 612. Other computation units,including multiple units of the same type may be used. Theinterconnection unit 604 serves to connect the outputs of computationalunits to the inputs of other computational units. The are many forms ofthe interconnection unit 604, these include a re-configurable switch,data memory or a register file. An accumulator 614 may also be connectedto the computation elements 602 via the interconnection unit 604. Datato be processed is passed to the interconnection unit 604 by data vectorinput unit 622, and processed data is retrieved from the interconnectionunit by data vector output unit 624. In general, multiple data vectorinput and output units are used. The computer is directed by a programof instruction stored in the program memory 618 of sequencer 620. Theinstructions control the data vector input and output units, 622 and624, the interconnection unit 604, the accumulator 614 and thecomputational units 602.

[0039] The present invention, as described in embodiments herein, isimplemented using a programmed processor executing a sequential list ofinstructions in the format described above. However, those skilled inthe art will appreciate that the processes described above can beimplemented in any number of variations without departing from thepresent invention. Such variations are contemplated and consideredequivalent.

[0040] While the invention has been described in conjunction withspecific embodiments, it is evident that many alternatives,modifications, permutations and variations will become apparent to thoseof ordinary skill in the art in light of the foregoing description.Accordingly, it is intended that the present invention should embraceall such alternatives, modifications and variations as fall within thescope of the appended claims.

What is claimed is:
 1. A method for producing a formatted description ofa computation representable by a data-flow graph, the method comprising:generating a source instruction for each input of the data-flow graph;generating a computational instruction for each node of the data-flowgraph, the computation instruction for a node comprising a descriptor ofthe operation performed at the node and a descriptor of each instructionthat produces an input to the node; generating a sink instruction foreach output of the data-flow graph; and generating a sequentialinstruction list comprising source instructions, computationalinstructions and sink instructions.
 2. A method in accordance with claim1, wherein the computation instruction further comprises descriptors ofat least one of the type, size and signedness of the result of theoperation performed at the node.
 3. A method in accordance with claim 1,wherein the formatted description further comprises an instructionidentifier for each instruction of the sequential instruction list andwherein the descriptor of each instruction that produces an input to thenode is the instruction identifier of the instruction of the sequentialinstruction list that produces an input to the node.
 4. A method inaccordance with claim 1, wherein the source instruction is one of aninstruction to load a data value from an input data vector, aninstruction to load a data value from an accumulator and an instructionsto load a data value from a tunnel.
 5. A method in accordance with claim1, wherein the sink instruction is one of an instruction to store a datavalue to an output data vector and an instruction to store a data valueto a tunnel.
 6. A method in accordance with claim 1, wherein the sinkinstruction is one of an instruction to put a data value into anaccumulator, an instruction to add a data value to an accumulator, aninstruction to subtract a data value from an accumulator, an instructionto store a data value if it is larger than a value in an accumulator andan instruction to store a data value if it is smaller than a value in anaccumulator.
 7. A method in accordance with claim 1, wherein the orderof instructions that change the state of the data-flow graph isdetermined by the order of the instructions in the sequentialinstruction list.
 8. A method in accordance with claim 1, wherein acomputational instruction comprises one of an arithmetic instruction, amultiplier instruction, a shift instruction and a logic instruction. 9.A method in accordance with claim 1, wherein a node has no more thanthree inputs.
 10. A computer comprising: a plurality of computationalunits; an interconnection unit for interconnecting the computationalunits; and a program memory; wherein the computer is directed by aprogram of instructions stored in the program memory to implement acomputation representable by a data-flow graph, and wherein the programof instructions is generated from a sequential instruction listcomprising: a source instruction for each input of the data-flow graph;a computational instruction for each node of the data-flow graph, thecomputation instruction for a node comprising a descriptor of theoperation of the node and a descriptor of each instruction that producesan input to the node; and a sink instruction for each output of thedata-flow graph.
 11. A computer in accordance with claim 10, wherein thecomputation instruction is executable on a computational unit of theplurality of processing units.
 12. A computer in accordance with claim10, wherein the interconnection unit of the computer is directed by aprogram of instructions stored in the program memory.
 13. A computer inaccordance with claim 10, wherein the interconnection unit of thecomputer comprises one of a data memory, a register file and are-configurable switch.
 14. A computer in accordance with claim 10,wherein the computation instruction further comprises descriptors of atleast one of the type, size and signedness of the result of theoperation performed at the node.
 15. A computer in accordance with claim10, wherein the formatted description further comprises an instructionidentifier for each instruction of the sequential instruction list andwherein at least one descriptor of an instruction that produces an inputto the node is the instruction identifier of the instruction of thesequential instruction list that produces an input to the node.
 16. Acomputer in accordance with claim 10, further comprising at least oneof: a data vector input unit; a plurality of registers operable as adata tunnel; and an accumulator; wherein the source instruction is oneof an instruction to load a data value from the data vector input unit,an instruction to load a data value from the accumulator and aninstructions to load a data value from the data tunnel.
 17. A computerin accordance with claim 16, wherein the sink instruction is aninstruction to store a data value to the data tunnel.
 18. A computer inaccordance with claim 10, further comprising at least one data vectoroutput unit, wherein the sink instruction is an instruction to store adata value to the data vector output unit.
 19. A computer in accordancewith claim 10, further comprising an accumulator, wherein the sinkinstruction is one of an instruction to put a data value into theaccumulator, an instruction to add a data value to the accumulator, aninstruction to subtract a data value from the accumulator, aninstruction to store a data value if it is larger than a value in theaccumulator and an instruction to store a data value if it is smallerthan a value in the accumulator.
 20. A computer in accordance with claim10, wherein the order of instructions that change the state of thedata-flow graph is determined by the order of these instructions in thesequential instruction list.
 21. A computer in accordance with claim 10,wherein the computational elements include one of an arithmetic unit, amultiplier and a logic unit and wherein a computational instructioncomprises one of an arithmetic instruction, a multiplier instruction, ashift instruction and a logic instruction.
 22. A computer in accordancewith claim 10, wherein the computational units are configurable forconcurrent processing multiple instructions.